Error in open.connection(x, "rb"): HTTP error 403.
Session 1: Scraping Interactive Web Pages
2023-07-28
In this session, we learn how to hunt down wild data. We will:
Philipp Pilz via unsplash.com
Initially we were planning to scrape researchgate.net, since it contains self-created profiles of many researchers. However, when you try to get the html content:
Error in open.connection(x, "rb"): HTTP error 403.
If you don’t know what an HTTP error means, you can go to https://http.cat and have the status explained in a fun way. Below I use a little convenience function:
Open the Inspect Window Again:
But this time, we focus on the Network tab:
Here we get an overview of all the network activity of the browser and the individual requests for data that are performed. Clear the network log first and reload the page to see what is going on. Finding the right call is not always easy, but in most cases, we want:
When you identified the call, you can right click -> copy -> copy as cURL
cURL CallsWhat is cURL:
cURL is a library that can make HTTP requests.-H arguments describe the headers, which are arguments sent with the call-d is the data or body of a request, which is used e.g., for uploading things-o/-O can be used to write the response to a file (otherwise the response is returned to the screen)--compressed means to ask for a compressed response which is unpacked locally (saves bandwith)curl 'https://www.researchgate.net/profile/Johannes-Gruber-2' \
-H 'authority: www.researchgate.net' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
-H 'accept-language: en-GB,en;q=0.9' \
-H 'cache-control: max-age=0' \
-H '[Redacted]' \
-H 'sec-ch-ua: "Chromium";v="115", "Not/A)Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Linux"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: cross-site' \
-H 'sec-fetch-user: ?1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
--compressedhttr2::curl_translate()httr2::curl_translate() in action yesterdayR no diffrent from a regular browser" in the call, press ctrl + F to open the Find & Replace tool and put " in the find \" in the replace field and go through all matches except the first and last):library(httr2)
httr2::curl_translate(
"curl 'https://www.researchgate.net/profile/Johannes-Gruber-2' \
-H 'authority: www.researchgate.net' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
-H 'accept-language: en-GB,en;q=0.9' \
-H 'cache-control: max-age=0' \
-H 'cookie: [Redacted]' \
-H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: \"Linux\"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: cross-site' \
-H 'sec-fetch-user: ?1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
--compressed"
)request("https://www.researchgate.net/profile/Johannes-Gruber-2") %>%
req_headers(
authority = "www.researchgate.net",
accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
`accept-language` = "en-GB,en;q=0.9",
`cache-control` = "max-age=0",
cookie = "[Redacted]",
`sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
`sec-ch-ua-mobile` = "?0",
`sec-ch-ua-platform` = "\"Linux\"",
`sec-fetch-dest` = "document",
`sec-fetch-mode` = "navigate",
`sec-fetch-site` = "cross-site",
`sec-fetch-user` = "?1",
`upgrade-insecure-requests` = "1",
`user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
) %>%
req_perform()
request("https://www.researchgate.net/profile/Johannes-Gruber-2") |>
req_headers(
authority = "www.researchgate.net",
accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
`accept-language` = "en-GB,en;q=0.9",
`cache-control` = "max-age=0",
cookie = "[Redacted]",
`sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
`sec-ch-ua-mobile` = "?0",
`sec-ch-ua-platform` = "\"Linux\"",
`sec-fetch-dest` = "document",
`sec-fetch-mode` = "navigate",
`sec-fetch-site` = "cross-site",
`sec-fetch-user` = "?1",
`upgrade-insecure-requests` = "1",
`user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
) |>
req_perform()This used to work quite well when I prepared the slides, but suddenly stopped working over the weekend. So I removed the rest of the slides about it…
class="agenda-content"Error in open.connection(x, "rb"): HTTP error 403.
cURL callcurl_translate("curl 'https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D' \
-H 'Accept: application/json, text/plain, */*' \
-H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Pragma: no-cache' \
-H 'Referer: https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: \"Linux\"' \
--compressed")request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D") %>%
req_headers(
Accept = "application/json, text/plain, */*",
`Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
`Cache-Control` = "no-cache",
Connection = "keep-alive",
Pragma = "no-cache",
Referer = "https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/",
`Sec-Fetch-Dest` = "empty",
`Sec-Fetch-Mode` = "cors",
`Sec-Fetch-Site` = "same-origin",
`User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
`sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
`sec-ch-ua-mobile` = "?0",
`sec-ch-ua-platform` = "\"Linux\"",
) %>%
req_perform()
ica_data <- request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D") |>
req_headers(
Accept = "application/json, text/plain, */*",
`Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
`Cache-Control` = "no-cache",
Connection = "keep-alive",
Pragma = "no-cache",
Referer = "https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/",
`Sec-Fetch-Dest` = "empty",
`Sec-Fetch-Mode` = "cors",
`Sec-Fetch-Site` = "same-origin",
`User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
`sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
`sec-ch-ua-mobile` = "?0",
`sec-ch-ua-platform` = "\"Linux\"",
) |>
req_perform() |>
resp_body_json()$id
[1] 3113186
$name
[1] "Race, Ethnicity, and Religion: Media Coverage, Messages, and Reactions"
$event_id
[1] "aic1_202305"
$start_time
[1] "09:00"
$end_time
[1] "10:15"
$calendar_stime
[1] "2023-05-28 09:00:00"
$calendar_etime
[1] "2023-05-28 10:15:00"
$place
[1] "M - Chestnut East"
$desc
[1] "<br /><br /><b>Papers: </b><br />Campaign Outreach or (His)Pandering?: Politician Spanish Usage in Media and Latino Voters<br /><i>Guadalupe Madrigal, U of Missouri - Columbia</i><br /><i>Angela Ocampo, U of Texas at Austin</i><br /><br />When a Pandemic Converted to Islamophobia: Indian News in the Time of Covid-19<br /><i>Arshad Amanullah, National U of Singapore</i><br /><i>Arif Nadaf, Islamic U of Science & Technology, Kashmir, India</i><br /><i>Taberez Neyazi, National U of Singapore</i><br /><br />Blaming Asians for Coronavirus: The Role of Valenced Framing and Discrete Emotions in Hostile Media Effect<br /><i>Juan Liu, Towson U</i><br /><br />Partisanship Supersedes Race: Effects of Discussant Race and Partisanship on Whites’ Willingness to Engage in Race-Specific Conversations<br /><i>Osei Appiah, The Ohio State U</i><br /><i>William Eveland, The Ohio State U</i><br /><i>Christina Henry, The Ohio State U</i><br /><br />Examining Racial Differences in Concerns About Online Polarization<br /><i>Cara Schumann, U of North Carolina at Chapel Hill</i><br /><i>Shannon McGregor, U of North Carolina at Chapel Hill</i> <a href='https://ica2023.cadmore.media/object/451982' style='text-decoration: none; background-color: #789F90; color: #FFFFFF; padding: 5px 10px; border: 1px solid #789F90; border-radius: 15px;'>Open Session</a><br /><br />"
$extra
$extra$docs
list()
$extra$live_stream
$extra$live_stream$url
[1] ""
$extra$recorded_video
$extra$recorded_video$url
[1] ""
$extra$order
[1] 791
$extra$type
[1] "Session"
$extra$rate_enabled
[1] TRUE
$extra$session_feedback_enable
[1] TRUE
$docs
list()
$session_order
[1] 791
$session_feedback_enable
[1] TRUE
$live_stream
$live_stream$url
[1] ""
$recorded_video
$recorded_video$url
[1] ""
$upload_video
NULL
$simulive_upload_video
NULL
$speaker
named list()
$expand
[1] "yes"
$speaker_label
[1] "Session chair"
$type
[1] 1
$sponsors
list()
$programs
list()
$tracks
$tracks[[1]]
$tracks[[1]]$name
[1] "In Person"
$tracks[[1]]$id
[1] 539417
$tracks[[1]]$color
[1] "#5C6BC0"
$tracks[[2]]
$tracks[[2]]$name
[1] "Political Communication"
$tracks[[2]]$id
[1] 540044
$tracks[[2]]$color
[1] "#a15284"
$tags
list()
I could not come up with a better method so far. The only way to extract the data is with a nested for loop going through all days and all entries in the object and looking for elements called “sessions”.
library(tidyverse, warn.conflicts = FALSE)
sessions <- list()
for (day in 1:5) {
times <- ica_data[["data"]][["agenda"]][[day]][["time_ranges"]]
for (l_one in seq_along(pluck(times))) {
for (l_two in seq_along(pluck(times, l_one))) {
for (l_three in seq_along(pluck(times, l_one, l_two))) {
for (l_four in seq_along(pluck(times, l_one, l_two, l_three))) {
session <- pluck(times, l_one, l_two, l_three, l_four, "sessions", 1)
id <- pluck(session, "id")
if (!is.null(id)) {
id <- as.character(id)
sessions[[id]] <- session
}
}
}
}
}
}ica_data_df <- tibble(
panel_id = map_int(sessions, "id"),
panel_name = map_chr(sessions, "name"),
time = map_chr(sessions, "calendar_stime"),
desc = map_chr(sessions, function(s) pluck(s, "desc", .default = NA))
)
ica_data_df# A tibble: 881 × 4
panel_id panel_name time desc
<int> <chr> <chr> <chr>
1 3113155 PRECONFERENCE: Games and the (Playful) Future of Commun… 2023… "Rec…
2 3113156 PRECONFERENCE: Generation Z and Global Communication 2023… "Gen…
3 3113166 PRECONFERENCE: Nothing About Us, Without Us: Authentic … 2023… "Thi…
4 3113172 PRECONFERENCE: Reimagining the Field of Media, War and … 2023… "As …
5 3113175 PRECONFERENCE: The Legacies of Elihu Katz 2023… "Eli…
6 3112705 Human-Machine Preconference Breakout (room 2) 2023… <NA>
7 3113080 New Avoidance Preconference Breakout (room 2) 2023… <NA>
8 3113150 PRECONFERENCE: 12th Annual Doctoral Consortium of the C… 2023… "The…
9 3113154 PRECONFERENCE: Ethics of Critically Interrogating and R… 2023… "The…
10 3113158 PRECONFERENCE: Human-Machine Communication: Authenticit… 2023… "The…
# ℹ 871 more rows
Finally we want to parse the HTML in the description column.
3113023
"<br /><br /><b>Participants: </b><br /><b><i>(Chairs) </i></b>Wayne Xu, U of Massachusetts Amherst<br /><br /><b>Papers: </b><br />Disentangling the Longitudinal Relationship Between Social Media Use, Political Expression and Political Participation: What Do We Really Know?<br /><i>Jörg Matthes, U of Vienna</i><br /><i>Andreas Nanz, U of Vienna</i><br /><i>Marlis Stubenvoll, U of Vienna</i><br /><i>Ruta Kaskeleviciute, U of Vienna</i><br /><br />Political Discussions on Russian YouTube: How Did They Change Since the Start of the War in Ukraine?<br /><i>Ekaterina Romanova, U of Florida</i><br /><br />Perceptions of and Reactions to Different Types of Incivility in Public Online Discussions: Results of an Online Experiment<br /><i>Marike Bormann, Unviersity of Düsseldorf</i><br /><i>Dominique Heinbach, Heinrich-Heine-U</i><br /><i>Jan Kluck, U of Duisburg-Essen</i><br /><i>Marc Ziegele, Heinrich Heine U</i><br /><br />When Trust in AI Mediates: AI News Use, Public Discussion, and Civic Participation<br /><i>Seungahn Nah, U of Florida</i><br /><i>Chun Shao, Arizona State U</i><br /><i>Ekaterina Romanova, U of Florida</i><br /><i>Gwiwon Nam, U of Florida</i><br /><i>Fanjue Liu, U of Florida</i> <a href='https://ica2023.cadmore.media/object/451094' style='text-decoration: none; background-color: #789F90; color: #FFFFFF; padding: 5px 10px; border: 1px solid #789F90; border-radius: 15px;'>Open Session</a><br /><br />"
We can inspect one of the descriptions using the same function as in session 3:
I wrote another function for this. You can check some of the panels using the browser: check_in_browser(ica_data_df$desc[100]).
pull_papers <- function(desc) {
# we extract the html code starting with the papers line
papers <- str_extract(desc, "<b>Papers: </b>.+$") |>
str_remove("<b>Papers: </b><br />") |>
# we split the html by double line breaks, since it is not properly formatted as paragraphs
strsplit("<br /><br />", fixed = TRUE) |>
pluck(1)
# if there is no html code left, just return NAs
if (all(is.na(papers))) {
return(list(list(paper_title = NA, authors = NA)))
} else {
# otherwise we loop through each paper
map(papers, function(t) {
html <- read_html(t)
# first line is the title
title <- html |>
html_text2() |>
str_extract("^.+\n")
# at least authors are formatted italice
authors <- html_elements(html, "i") |>
html_text2()
list(paper_title = title, authors = authors)
})
}
}Now we have all the information we wanted:
ica_data_df_tidy <- ica_data_df |>
slice(-613) |>
mutate(papers = map(desc, pull_papers)) |>
unnest(papers) |>
unnest_wider(papers) |>
unnest(authors) |>
select(-desc) |>
filter(!is.na(authors))
ica_data_df_tidy# A tibble: 8,169 × 5
panel_id panel_name time paper_title authors
<int> <chr> <chr> <chr> <chr>
1 3113249 The Powers of Platforms 2023-05-2… "Serve the… Changw…
2 3113249 The Powers of Platforms 2023-05-2… "Serve the… Ziyi W…
3 3113249 The Powers of Platforms 2023-05-2… "Serve the… Joel G…
4 3113249 The Powers of Platforms 2023-05-2… "Empowered… Andrea…
5 3113249 The Powers of Platforms 2023-05-2… "Empowered… Jacob …
6 3113249 The Powers of Platforms 2023-05-2… "The Rise … Guy Ho…
7 3113249 The Powers of Platforms 2023-05-2… "Google Ne… Lucia …
8 3113249 The Powers of Platforms 2023-05-2… "Google Ne… Mathia…
9 3113249 The Powers of Platforms 2023-05-2… "Google Ne… Amalia…
10 3112411 Affiliate Journals Top Papers Session 2023-05-2… "One Year … Eloria…
# ℹ 8,159 more rows
Open the ICA site in your browser and inspect the network traffic. Can you identify the call to the programme json?
I excluded panel 613 since the function fails on that. Investigate what the problem is
This doesn’t look too bad! The link even shows a selected_session_id, which could be handy:
https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Session&**selected_session_id=2069362**&PHPSESSID=9s7asg63fpouugut6m5m2vj36r
library(rvest)
html <- read_html("https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=2023-09-01&program_focus=browse_by_day_submissions")
panels <- html |>
html_elements("li a") |>
html_element("p") |>
html_text2()
panel_links <- html |>
html_elements("li a") |>
html_attr("href")
tibble(panels, panel_links)# A tibble: 36 × 2
panels panel_links
<chr> <chr>
1 Search https://convention2.allacademic.com//o…
2 Browse By Day https://convention2.allacademic.com//o…
3 Browse By Time https://convention2.allacademic.com//o…
4 Browse By Person https://convention2.allacademic.com//o…
5 Browse By Mini-Conference https://convention2.allacademic.com//o…
6 Browse By Division https://convention2.allacademic.com//o…
7 Browse By Session or Event Type https://convention2.allacademic.com//o…
8 Browse Sessions by Fields of Interest https://convention2.allacademic.com//o…
9 Browse Papers by Fields of Interest https://convention2.allacademic.com//o…
10 Search Tips https://convention2.allacademic.com/on…
# ℹ 26 more rows
These do not look like the panel links! What’s going on?!
The object html is not that easy to evaluate since it contains html code not made for human eyes and the output is truncated while printing.
We can adapt the function we used before to convert the rvest object to a character object and display the content of the object in a browser:
Following the same strategy as before:
httr2::curl_translate() (make sure to escape mischievous “)httr2::curl_translate("curl 'https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=2023-08-31&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
-H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Cookie: 9s7asg63fpouugut6m5m2vj36r[msg]=e52640799a6bbcebac16c0205ffc2cd9; fvjf6ltd4o45kgpv2occcrr0al[msg]=999aa7691451c5d15ddf91ee0a902f3b; _ga=GA1.2.2046361133.1690277724; _gid=GA1.2.499473362.1690277724; monster[/one/apsa/apsa23/][fvjf6ltd4o45kgpv2occcrr0al][created]=1690532022; _gat=1; _gat_extraTracker=1; _ga_79KQXM4T08=GS1.2.1690530570.6.1.1690532023.0.0.0; _ga_JWPT5JHJ1E=GS1.2.1690530570.6.1.1690532024.0.0.0' \
-H 'Pragma: no-cache' \
-H 'Referer: https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+Load+Focus&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al' \
-H 'Sec-Fetch-Dest: document' \
-H 'Sec-Fetch-Mode: navigate' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Sec-Fetch-User: ?1' \
-H 'Upgrade-Insecure-Requests: 1' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: \"Linux\"' \
--compressed")request("https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=2023-08-31&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al") %>%
req_headers(
Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
`Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
`Cache-Control` = "no-cache",
Connection = "keep-alive",
Cookie = "9s7asg63fpouugut6m5m2vj36r[msg]=e52640799a6bbcebac16c0205ffc2cd9; fvjf6ltd4o45kgpv2occcrr0al[msg]=999aa7691451c5d15ddf91ee0a902f3b; _ga=GA1.2.2046361133.1690277724; _gid=GA1.2.499473362.1690277724; monster[/one/apsa/apsa23/][fvjf6ltd4o45kgpv2occcrr0al][created]=1690532022; _gat=1; _gat_extraTracker=1; _ga_79KQXM4T08=GS1.2.1690530570.6.1.1690532023.0.0.0; _ga_JWPT5JHJ1E=GS1.2.1690530570.6.1.1690532024.0.0.0",
Pragma = "no-cache",
Referer = "https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+Load+Focus&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al",
`Sec-Fetch-Dest` = "document",
`Sec-Fetch-Mode` = "navigate",
`Sec-Fetch-Site` = "same-origin",
`Sec-Fetch-User` = "?1",
`Upgrade-Insecure-Requests` = "1",
`User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
`sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
`sec-ch-ua-mobile` = "?0",
`sec-ch-ua-platform` = "\"Linux\"",
) %>%
req_perform()
httr2 and check if we get the right contenthtml <- request("https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=2023-08-31&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al") |>
req_headers(
Cookie = "Cookie: 9s7asg63fpouugut6m5m2vj36r[msg]=e52640799a6bbcebac16c0205ffc2cd9; fvjf6ltd4o45kgpv2occcrr0al[msg]=999aa7691451c5d15ddf91ee0a902f3b; _ga=GA1.2.2046361133.1690277724; _gid=GA1.2.499473362.1690277724; monster[/one/apsa/apsa23/][fvjf6ltd4o45kgpv2occcrr0al][created]=1690532022; _gat=1; _gat_extraTracker=1; _ga_79KQXM4T08=GS1.2.1690530570.6.1.1690532023.0.0.0; _ga_JWPT5JHJ1E=GS1.2.1690530570.6.1.1690532024.0.0.0",
Referer = "https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+Load+Focus&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al",
) |>
req_perform() |>
# we add this part to extract the html from the response
resp_body_html()httr2 call to make it usable to request other dataAfter some investigation, I noticed that the API returns the right information when it has two things in the call:
PHPSESSID query parameter)request_apsa <- function(url,
sess_id = NULL,
cookies = "9s7asg63fpouugut6m5m2vj36r[msg]=e52640799a6bbcebac16c0205ffc2cd9; fvjf6ltd4o45kgpv2occcrr0al[msg]=999aa7691451c5d15ddf91ee0a902f3b; _ga=GA1.2.2046361133.1690277724; _gid=GA1.2.499473362.1690277724; monster[/one/apsa/apsa23/][fvjf6ltd4o45kgpv2occcrr0al][created]=1690532022; _gat=1; _gat_extraTracker=1; _ga_79KQXM4T08=GS1.2.1690530570.6.1.1690532023.0.0.0; _ga_JWPT5JHJ1E=GS1.2.1690530570.6.1.1690532024.0.0.0") {
# extract the session id from the URL if not given
sess_id <- str_extract(url, "&PHPSESSID=[a-z0-9]+(&|$)")
referer <- paste0(
"https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+Load+Focus&program_focus=browse_by_day_submissions",
sess_id
)
request(url) |>
req_headers(
Referer = referer,
Cookie = cookies
) |>
# Let's set a cautious rate in case they check for scraping
req_throttle(6 / 60) |>
req_perform() |>
resp_body_html()
}Let’s test this on a panel:
Luckily, the HTML is quite clean and easy to parse with the tools we’ve learned already:
panel_title <- panel_3_html |>
html_element("h3") |>
html_text2()
panel_description <- panel_3_html |>
html_element("blockquote") |>
html_text2()
paper_urls <- panel_3_html |>
html_elements("li a") |>
html_attr("href")
paper_description <- panel_3_html |>
html_elements("li a") |>
html_text2()
tibble(paper_description, paper_urls) |>
# we collected some trash, but can filter it out easily using the URL
filter(str_detect(paper_urls, "selected_paper_id=")) |>
# We separate paper title and authors from each other
separate(paper_description, into = c("paper", "authors"), sep = " - ") |>
# If there are several authors they are divided by ; (we split them up)
mutate(author = strsplit(authors, split = "; ")) |>
# pull the list out into a long format
unnest(author) |>
# And add some infoormation from above
mutate(panel_title = panel_title,
paper_description = panel_description)# A tibble: 5 × 6
paper authors paper_urls author panel_title paper_description
<chr> <chr> <chr> <chr> <chr> <chr>
1 "Anger-Driven Misinfo… Cengiz… https://c… Cengi… Immigratio… These papers con…
2 "Anger-Driven Misinfo… Cengiz… https://c… Sofia… Immigratio… These papers con…
3 "Kids in Cages: When … Frank … https://c… Frank… Immigratio… These papers con…
4 "Kids in Cages: When … Frank … https://c… Allis… Immigratio… These papers con…
5 "From “Illegal” to “U… Jacob … https://c… Jacob… Immigratio… These papers con…
We combine the request for a panel’s html and the parsing in one function:
scrape_panel <- function(url) {
sess_id <- str_extract(url, "(?<=selected_session_id=)\\d+")
message("Requesting session ", sess_id)
# request the URL with out request function
html <- request_apsa(url)
# Running the parser
title <- html |>
html_element("h3") |>
html_text2()
description <- html |>
html_element("blockquote") |>
html_text2()
paper_urls <- html |>
html_elements("li a") |>
html_attr("href")
paper_description <- html |>
html_elements("li a") |>
html_text2()
tibble(paper_description, paper_urls) |>
filter(str_detect(paper_urls, "selected_paper_id=")) |>
separate(paper_description, into = c("paper", "authors"), sep = " - ") |>
mutate(author = strsplit(authors, split = ";")) |>
unnest(author) |>
mutate(panel_title = title,
panel_description = description)
}
scrape_panel(panels$panel_links[4])# A tibble: 7 × 6
paper authors paper_urls author panel_title panel_description
<chr> <chr> <chr> <chr> <chr> <chr>
1 Winning Elections wit… Shusei… https://c… "Shus… Dominant P… "An enduring puz…
2 Winning Elections wit… Shusei… https://c… " Yus… Dominant P… "An enduring puz…
3 Winning Elections wit… Shusei… https://c… " Shi… Dominant P… "An enduring puz…
4 Winning Elections wit… Shusei… https://c… " Dan… Dominant P… "An enduring puz…
5 A Theory of Group-Bas… Amy Lo… https://c… "Amy … Dominant P… "An enduring puz…
6 In-Group Anger or Out… Shikha… https://c… "Shik… Dominant P… "An enduring puz…
7 Reelection Can Increa… Lucia … https://c… "Luci… Dominant P… "An enduring puz…
scrape_panel <- function(url,
cache_dir = "../data/apsa2023/") {
# the default is an empty file name
f_name <- ""
# If the cache_dir is not empty, a file name in constructed
if (!is.null(cache_dir)) {
# we make sure that the cache folder is created if it does not exist
dir.create(cache_dir, showWarnings = FALSE)
# we extract the session ID from the URL
sess_id <- str_extract(url, "(?<=selected_session_id=)\\d+")
# and use it to construct a file path for saving
f_name <- file.path(cache_dir, paste0(sess_id, ".rds"))
}
# if the cache file already exists, we can skip this session :)
if (!file.exists(f_name)) {
message("Requesting session ", sess_id)
html <- request_apsa(url)
title <- html |>
html_element("h3") |>
html_text2()
description <- html |>
html_element("blockquote") |>
html_text2()
paper_urls <- html |>
html_elements("li a") |>
html_attr("href")
paper_description <- html |>
html_elements("li a") |>
html_text2()
out <- tibble(paper_description, paper_urls) |>
filter(str_detect(paper_urls, "selected_paper_id=")) |>
separate(paper_description, into = c("paper", "authors"), sep = " - ") |>
mutate(author = strsplit(authors, split = ";")) |>
unnest(author) |>
mutate(panel_title = title,
panel_description = description)
if (!is.null(cache_dir)) {
saveRDS(out, f_name)
}
} else {
# If the file does not exist, we read the cached panel data
out <- readRDS(f_name)
}
out
}
scrape_panel(panels$panel_links[4])# A tibble: 7 × 6
paper authors paper_urls author panel_title panel_description
<chr> <chr> <chr> <chr> <chr> <chr>
1 Winning Elections wit… Shusei… https://c… "Shus… Dominant P… "An enduring puz…
2 Winning Elections wit… Shusei… https://c… " Yus… Dominant P… "An enduring puz…
3 Winning Elections wit… Shusei… https://c… " Shi… Dominant P… "An enduring puz…
4 Winning Elections wit… Shusei… https://c… " Dan… Dominant P… "An enduring puz…
5 A Theory of Group-Bas… Amy Lo… https://c… "Amy … Dominant P… "An enduring puz…
6 In-Group Anger or Out… Shikha… https://c… "Shik… Dominant P… "An enduring puz…
7 Reelection Can Increa… Lucia … https://c… "Luci… Dominant P… "An enduring puz…
Much quicker, since I’ve done this before!
We loop over the days of APSA to collect all links:
days <- seq(as.Date("2023-08-30"), as.Date("2023-09-03"), 1)
panel_links <- map(days, function(d) {
html <- request_apsa(
paste0("https://convention2.allacademic.com/one/apsa/apsa23/index.php?cmd=Online+Program+View+Selected+Day+Submissions&selected_day=",
d,
"&program_focus=browse_by_day_submissions&PHPSESSID=fvjf6ltd4o45kgpv2occcrr0al"
))
html |>
html_elements("li a") |>
html_attr("href") |>
str_subset("session_id")
}) |>
unlist()
length(panel_links)[1] 1574
And now we iterate over these links to collect all panel data:
We make sure to save the combined data:
And let’s check the most prolific authors again:
# A tibble: 7,309 × 2
author n
<chr> <int>
1 " Jonathan Nagler, New York University" 8
2 " Joshua A. Tucker, New York University" 8
3 " Baekkwan Park, University of Missouri" 5
4 " Carl Henrik Knutsen, Department of Political Science, University of … 5
5 " Fabrizio Gilardi, University of Zurich" 5
6 " Peter Loewen, University of Toronto" 5
7 " Aykut Ozturk, University of Glasgow" 4
8 " Beatrice Magistro, California Institute of Technology" 4
9 " Geoffrey Sheagley, University of Georgia" 4
10 " Maël Dominique Kubli, University of Zurich" 4
# ℹ 7,299 more rows
Use your own cookies and session ID to run the function on the page with the URLs
Check the German news website https://www.zeit.de/. It has an interesting quirk that prevents you from scraping the content of the site. What is it and how could you get around it?
Save some information about the session for reproducibility.
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: EndeavourOS
Matrix products: default
BLAS: /usr/lib/libblas.so.3.11.0
LAPACK: /usr/lib/liblapack.so.3.11.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
[5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[9] ggplot2_3.4.2 tidyverse_2.0.0 httr2_0.2.3 rvest_1.0.3
loaded via a namespace (and not attached):
[1] gtable_0.3.3 jsonlite_1.8.7 selectr_0.4-2 compiler_4.3.1
[5] tidyselect_1.2.0 xml2_1.3.5 scales_1.2.1 yaml_2.3.7
[9] fastmap_1.1.1 R6_2.5.1 generics_0.1.3 curl_5.0.1
[13] knitr_1.43 munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0
[17] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12 xfun_0.39
[21] timechange_0.2.0 cli_3.6.1 withr_2.5.0 magrittr_2.0.3
[25] digest_0.6.33 grid_4.3.1 rstudioapi_0.15.0 rappdirs_0.3.3
[29] hms_1.1.3 lifecycle_1.0.3 vctrs_0.6.3 evaluate_0.21
[33] glue_1.6.2 codetools_0.2-19 fansi_1.0.4 colorspace_2.1-0
[37] rmarkdown_2.23 httr_1.4.6 tools_4.3.1 pkgconfig_2.0.3
[41] htmltools_0.5.5